PM Accelerator Tech Assessment: Weather Trend Forecasting¶
Objective:¶
Used the Global Weather Repository.csv to predict future weather patterns and demonstrate data science skills using a mix of foundational and advanced methods. This dataset provides daily weather data for cities worldwide, featuring over 40 attributes that capture global weather conditions. The dataset is available on Kaggle at Global Weather Repository, contains 47162 records and 41 columns, including variables like temperature, precipitation, wind speed, and air quality metrics.
Data Cleaning and Preprocessing¶
In this phase, I perfrorm data cleaning and preprocessing by handling missing values, removing duplicates, and detecting and capping outliers using the IQR method. Additionally, histograms are used to visualize the distribution of numeric data before and after normalization, ensuring the dataset is clean, consistent, and ready for further analysis.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("GlobalWeatherRepository.csv")
pd.set_option('display.max_columns', None)
df.head()
| country | location_name | latitude | longitude | timezone | last_updated_epoch | last_updated | temperature_celsius | temperature_fahrenheit | condition_text | wind_mph | wind_kph | wind_degree | wind_direction | pressure_mb | pressure_in | precip_mm | precip_in | humidity | cloud | feels_like_celsius | feels_like_fahrenheit | visibility_km | visibility_miles | uv_index | gust_mph | gust_kph | air_quality_Carbon_Monoxide | air_quality_Ozone | air_quality_Nitrogen_dioxide | air_quality_Sulphur_dioxide | air_quality_PM2.5 | air_quality_PM10 | air_quality_us-epa-index | air_quality_gb-defra-index | sunrise | sunset | moonrise | moonset | moon_phase | moon_illumination | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | Kabul | 34.52 | 69.18 | Asia/Kabul | 1715849100 | 2024-05-16 13:15 | 26.6 | 79.8 | Partly Cloudy | 8.3 | 13.3 | 338 | NNW | 1012.0 | 29.89 | 0.0 | 0.00 | 24 | 30 | 25.3 | 77.5 | 10.0 | 6.0 | 7.0 | 9.5 | 15.3 | 277.0 | 103.0 | 1.1 | 0.2 | 8.4 | 26.6 | 1 | 1 | 04:50 AM | 06:50 PM | 12:12 PM | 01:11 AM | Waxing Gibbous | 55 |
| 1 | Albania | Tirana | 41.33 | 19.82 | Europe/Tirane | 1715849100 | 2024-05-16 10:45 | 19.0 | 66.2 | Partly cloudy | 6.9 | 11.2 | 320 | NW | 1012.0 | 29.88 | 0.1 | 0.00 | 94 | 75 | 19.0 | 66.2 | 10.0 | 6.0 | 5.0 | 11.4 | 18.4 | 193.6 | 97.3 | 0.9 | 0.1 | 1.1 | 2.0 | 1 | 1 | 05:21 AM | 07:54 PM | 12:58 PM | 02:14 AM | Waxing Gibbous | 55 |
| 2 | Algeria | Algiers | 36.76 | 3.05 | Africa/Algiers | 1715849100 | 2024-05-16 09:45 | 23.0 | 73.4 | Sunny | 9.4 | 15.1 | 280 | W | 1011.0 | 29.85 | 0.0 | 0.00 | 29 | 0 | 24.6 | 76.4 | 10.0 | 6.0 | 5.0 | 13.9 | 22.3 | 540.7 | 12.2 | 65.1 | 13.4 | 10.4 | 18.4 | 1 | 1 | 05:40 AM | 07:50 PM | 01:15 PM | 02:14 AM | Waxing Gibbous | 55 |
| 3 | Andorra | Andorra La Vella | 42.50 | 1.52 | Europe/Andorra | 1715849100 | 2024-05-16 10:45 | 6.3 | 43.3 | Light drizzle | 7.4 | 11.9 | 215 | SW | 1007.0 | 29.75 | 0.3 | 0.01 | 61 | 100 | 3.8 | 38.9 | 2.0 | 1.0 | 2.0 | 8.5 | 13.7 | 170.2 | 64.4 | 1.6 | 0.2 | 0.7 | 0.9 | 1 | 1 | 06:31 AM | 09:11 PM | 02:12 PM | 03:31 AM | Waxing Gibbous | 55 |
| 4 | Angola | Luanda | -8.84 | 13.23 | Africa/Luanda | 1715849100 | 2024-05-16 09:45 | 26.0 | 78.8 | Partly cloudy | 8.1 | 13.0 | 150 | SSE | 1011.0 | 29.85 | 0.0 | 0.00 | 89 | 50 | 28.7 | 83.6 | 10.0 | 6.0 | 8.0 | 12.5 | 20.2 | 2964.0 | 19.0 | 72.7 | 31.5 | 183.4 | 262.3 | 5 | 10 | 06:12 AM | 05:55 PM | 01:17 PM | 12:38 AM | Waxing Gibbous | 55 |
df['last_updated'] = pd.to_datetime(df['last_updated'])
pd.isnull(df).sum()
country 0 location_name 0 latitude 0 longitude 0 timezone 0 last_updated_epoch 0 last_updated 0 temperature_celsius 0 temperature_fahrenheit 0 condition_text 0 wind_mph 0 wind_kph 0 wind_degree 0 wind_direction 0 pressure_mb 0 pressure_in 0 precip_mm 0 precip_in 0 humidity 0 cloud 0 feels_like_celsius 0 feels_like_fahrenheit 0 visibility_km 0 visibility_miles 0 uv_index 0 gust_mph 0 gust_kph 0 air_quality_Carbon_Monoxide 0 air_quality_Ozone 0 air_quality_Nitrogen_dioxide 0 air_quality_Sulphur_dioxide 0 air_quality_PM2.5 0 air_quality_PM10 0 air_quality_us-epa-index 0 air_quality_gb-defra-index 0 sunrise 0 sunset 0 moonrise 0 moonset 0 moon_phase 0 moon_illumination 0 dtype: int64
I'm are checking for missing values in the dataset by counting the number of null entries in each column
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 47162 entries, 0 to 47161 Data columns (total 41 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 country 47162 non-null object 1 location_name 47162 non-null object 2 latitude 47162 non-null float64 3 longitude 47162 non-null float64 4 timezone 47162 non-null object 5 last_updated_epoch 47162 non-null int64 6 last_updated 47162 non-null datetime64[ns] 7 temperature_celsius 47162 non-null float64 8 temperature_fahrenheit 47162 non-null float64 9 condition_text 47162 non-null object 10 wind_mph 47162 non-null float64 11 wind_kph 47162 non-null float64 12 wind_degree 47162 non-null int64 13 wind_direction 47162 non-null object 14 pressure_mb 47162 non-null float64 15 pressure_in 47162 non-null float64 16 precip_mm 47162 non-null float64 17 precip_in 47162 non-null float64 18 humidity 47162 non-null int64 19 cloud 47162 non-null int64 20 feels_like_celsius 47162 non-null float64 21 feels_like_fahrenheit 47162 non-null float64 22 visibility_km 47162 non-null float64 23 visibility_miles 47162 non-null float64 24 uv_index 47162 non-null float64 25 gust_mph 47162 non-null float64 26 gust_kph 47162 non-null float64 27 air_quality_Carbon_Monoxide 47162 non-null float64 28 air_quality_Ozone 47162 non-null float64 29 air_quality_Nitrogen_dioxide 47162 non-null float64 30 air_quality_Sulphur_dioxide 47162 non-null float64 31 air_quality_PM2.5 47162 non-null float64 32 air_quality_PM10 47162 non-null float64 33 air_quality_us-epa-index 47162 non-null int64 34 air_quality_gb-defra-index 47162 non-null int64 35 sunrise 47162 non-null object 36 sunset 47162 non-null object 37 moonrise 47162 non-null object 38 moonset 47162 non-null object 39 moon_phase 47162 non-null object 40 moon_illumination 47162 non-null int64 dtypes: datetime64[ns](1), float64(23), int64(7), object(10) memory usage: 14.8+ MB
Getting a summary of the dataframe, specificaly the number of non-null values & data types
print(df.duplicated().sum())
0
checks how many duplicate rows are in the dataframe
def remove_outliers_IQR(df):
numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns
df_clean = df.copy()
outlier_counts = {}
for col in numeric_cols:
Q1 = df_clean[col].quantile(0.25)
Q3 = df_clean[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers_before = len(df_clean[(df_clean[col] < lower_bound) | (df_clean[col] > upper_bound)])
# Remove outliers
df_clean = df_clean[(df_clean[col] >= lower_bound) & (df_clean[col] <= upper_bound)]
# Detect outliers after capping
outliers_after = len(df_clean[(df_clean[col] < lower_bound) | (df_clean[col] > upper_bound)])
outlier_counts[col] = (outliers_before, outliers_after)
return df_clean, outlier_counts
cleaned_data, outlier_counts = remove_outliers_IQR(df)
for column, (outliers_before, outliers_after) in outlier_counts.items():
print(f'{column}: Outliers before capping: {outliers_before}, Outliers after capping: {outliers_after}')
latitude: Outliers before capping: 0, Outliers after capping: 0 longitude: Outliers before capping: 3628, Outliers after capping: 0 last_updated_epoch: Outliers before capping: 0, Outliers after capping: 0 temperature_celsius: Outliers before capping: 1530, Outliers after capping: 0 temperature_fahrenheit: Outliers before capping: 1077, Outliers after capping: 0 wind_mph: Outliers before capping: 509, Outliers after capping: 0 wind_kph: Outliers before capping: 31, Outliers after capping: 0 wind_degree: Outliers before capping: 0, Outliers after capping: 0 pressure_mb: Outliers before capping: 1914, Outliers after capping: 0 pressure_in: Outliers before capping: 1652, Outliers after capping: 0 precip_mm: Outliers before capping: 7247, Outliers after capping: 0 precip_in: Outliers before capping: 0, Outliers after capping: 0 humidity: Outliers before capping: 0, Outliers after capping: 0 cloud: Outliers before capping: 0, Outliers after capping: 0 feels_like_celsius: Outliers before capping: 402, Outliers after capping: 0 feels_like_fahrenheit: Outliers before capping: 84, Outliers after capping: 0 visibility_km: Outliers before capping: 3796, Outliers after capping: 0 visibility_miles: Outliers before capping: 0, Outliers after capping: 0 uv_index: Outliers before capping: 0, Outliers after capping: 0 gust_mph: Outliers before capping: 193, Outliers after capping: 0 gust_kph: Outliers before capping: 22, Outliers after capping: 0 air_quality_Carbon_Monoxide: Outliers before capping: 2380, Outliers after capping: 0 air_quality_Ozone: Outliers before capping: 202, Outliers after capping: 0 air_quality_Nitrogen_dioxide: Outliers before capping: 3104, Outliers after capping: 0 air_quality_Sulphur_dioxide: Outliers before capping: 1880, Outliers after capping: 0 air_quality_PM2.5: Outliers before capping: 1322, Outliers after capping: 0 air_quality_PM10: Outliers before capping: 1234, Outliers after capping: 0 air_quality_us-epa-index: Outliers before capping: 2078, Outliers after capping: 0 air_quality_gb-defra-index: Outliers before capping: 1477, Outliers after capping: 0 moon_illumination: Outliers before capping: 0, Outliers after capping: 0
The IQR method removes outliers by calculating the interquartile range (IQR) for each numeric column and filtering values outside the calculated bounds. After applying this method, there are now 0 outliers in the cleaned dataset
for col in cleaned_data.columns:
if cleaned_data[col].dtype != 'object':
plt.figure(figsize=(12, 6))
sns.histplot(df[col], kde=True, bins=30)
plt.title(f'Distribution of {col} (Before normalization)')
plt.xlabel(col)
plt.ylabel('Frequency')
plt.show()
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) File ~/Downloads/anaconda3/lib/python3.12/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key) 3804 try: -> 3805 return self._engine.get_loc(casted_key) 3806 except KeyError as err: File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc() File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc() File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item() File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'year' The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) Cell In[85], line 4 2 if cleaned_data[col].dtype != 'object': 3 plt.figure(figsize=(12, 6)) ----> 4 sns.histplot(df[col], kde=True, bins=30) 5 plt.title(f'Distribution of {col} (Before normalization)') 6 plt.xlabel(col) File ~/Downloads/anaconda3/lib/python3.12/site-packages/pandas/core/frame.py:4102, in DataFrame.__getitem__(self, key) 4100 if self.columns.nlevels > 1: 4101 return self._getitem_multilevel(key) -> 4102 indexer = self.columns.get_loc(key) 4103 if is_integer(indexer): 4104 indexer = [indexer] File ~/Downloads/anaconda3/lib/python3.12/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key) 3807 if isinstance(casted_key, slice) or ( 3808 isinstance(casted_key, abc.Iterable) 3809 and any(isinstance(x, slice) for x in casted_key) 3810 ): 3811 raise InvalidIndexError(key) -> 3812 raise KeyError(key) from err 3813 except TypeError: 3814 # If we have a listlike key, _check_indexing_error will raise 3815 # InvalidIndexError. Otherwise we fall through and re-raise 3816 # the TypeError. 3817 self._check_indexing_error(key) KeyError: 'year'
<Figure size 1200x600 with 0 Axes>
This code makes histograms for each number-based column in the data to show how the values are spread out before any changes are made. It uses 30 bins to group the data and also adds a smooth line to show the overall shape of the data.
for col in cleaned_data.columns:
if cleaned_data[col].dtype != 'object':
plt.figure(figsize=(12, 6))
sns.histplot(cleaned_data[col], kde=True, bins=30)
plt.title(f'Dist. of {col} (After normalization)')
plt.xlabel(col)
plt.ylabel('Frequency')
plt.show()
This code generates histograms with KDE plots for each numeric column in cleaned_data to visualize their distributions after normalization.
Basic EDA¶
In my EDA process, I create visualizations like line plots to observe trends in temperature, precipitation, and other variables, and scatter plots to show relationships. I also use correlation matrices to analyze the connections between weather and air quality data.
plt.figure(figsize=(10, 6))
sns.lineplot(x='last_updated', y='temperature_celsius', data=cleaned_data, color='orange')
plt.title('Temperature Trends Over Time')
plt.xticks(rotation=45)
plt.show()
plt.figure(figsizeb=(10, 6))
sns.lineplot(x='last_updated', y='precip_mm', data=cleaned_data, color='hotpink')
plt.title('Precipitation Trends Over Time')
plt.xticks(rotation=45)
plt.show()
I created lineplots to show the trends of temperature and precipitation over time using last_updated as the x-axis. The orange line graph shows the trends in temperature in celsius over time with temperature increasing in the summer and decreasing in the winter. The pink line graph shows spikes of rain followed by periods of dryness.
import plotly.express as px
list_of_countries = ['Portugal', 'Argentina', 'Brazil', 'Germany']
trend = cleaned_data.query(f"country in @list_of_countries")
fig = px.line(
trend,
x='last_updated',
y='precip_mm',
title="Precipitation Trends by Country",
color="country",
markers=True,
hover_data=['precip_mm']
)
fig.update_xaxes(title="Timestamp of Observation")
fig.update_yaxes(title="Precipitation (mm)")
fig.update_traces(marker=dict(size=6), line=dict(width=2))
fig.update_layout(
legend_title=dict(text="Country"),
height=600,
width=1000,
title_font=dict(size=20)
)
fig.show()
Argentina, Germany, and Portugal show relatively low precipitation levels, with sporadic high spikes in some months. Whereas Brazil shows a significant spike in precipitation around January 2025.
list_of_countries = ['Portugal', 'Argentina', 'Brazil', 'Germany']
trend = cleaned_data.query(f"country in @list_of_countries")
fig = px.line(
trend,
x='last_updated',
y='temperature_fahrenheit',
title="Temperature Trends by Country",
color="country",
markers=True,
hover_data=['temperature_fahrenheit']
)
fig.update_xaxes(title="Timestamp of Observation")
fig.update_yaxes(title="Temperature (°F)")
fig.update_traces(marker=dict(size=6), line=dict(width=2))
fig.update_layout(
legend_title=dict(text="Country"),
height=600,
width=1000,
title_font=dict(size=20)
)
fig.show()
The line graph illustrates the temperature trends of Argentina, Germany, Portugal, and Brazil from June 2024 to January 2025, showing significant fluctuations in Argentina, Germany, and Portugal while Brazil shows a steady decrease in temperature over the same period.
plt.figure(figsize=(10, 6))
sns.scatterplot(x=cleaned_data['wind_kph'], y=cleaned_data['gust_kph'], color = 'hotpink')
plt.title('Wind Speed vs Gust Speed')
plt.xlabel('Wind Speed (kph)')
plt.ylabel('Gust Speed (kph)')
plt.show()
The scatter plot reveals a clear positive linear relationship between wind speed and gust speed, indicating that higher wind speeds are associated with stronger gusts. The close clustering of data points suggests a strong correlation, emphasizing wind speed as a significant factor in gust intensity.
correlation_matrix = cleaned_data[['temperature_celsius', 'humidity', 'wind_kph', 'precip_mm', 'pressure_mb']].corr()
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix of Weather Variables')
plt.show()
Overall, the correlation matrix suggests that temperature and humidity as well as pressure and temperature have the strongest relationship amongst this group of features, and precipitation and humidity have a moderate positive relationship. The other variables do not appear to be strongly correlated with each other.
air_quality_cols = ['air_quality_PM2.5', 'air_quality_PM10', 'air_quality_Carbon_Monoxide', 'air_quality_Nitrogen_dioxide']
correlation_matrix_air_quality = cleaned_data[air_quality_cols].corr()
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix_air_quality, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix of Air Quality Indicators')
plt.show()
Air Quality PM2.5 and Air Quality PM10 have a strong positive correlation (0.75). This suggests that when PM2.5 levels are high, PM10 levels are also likely to be high.
Advanced and Basic Model Building/Forecasting¶
I built three models—gradient boosting, random forest, and linear regression—to predict precipitation trends. I evaluated each model using MAE, RMSE, and R2, and then created an ensemble of their predictions to improve accuracy, as part of the basic and advanced forecasting requirements.
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
cleaned_data['last_updated'] = pd.to_datetime(cleaned_data['last_updated'])
cleaned_data['year'] = cleaned_data['last_updated'].dt.year
cleaned_data['month'] = cleaned_data['last_updated'].dt.month
cleaned_data['day'] = cleaned_data['last_updated'].dt.day
cleaned_data['hour'] = cleaned_data['last_updated'].dt.hour
cleaned_data['minute'] = cleaned_data['last_updated'].dt.minute
cleaned_data['weekday'] = cleaned_data['last_updated'].dt.weekday
cleaned_data = cleaned_data.drop(columns=['last_updated'])
features = cleaned_data.drop(columns=['precip_mm', 'anomaly'], errors='ignore')
target = cleaned_data['precip_mm']
features = pd.get_dummies(features, drop_first=True)
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)
models = {
"Gradient Boosting": GradientBoostingRegressor(random_state=42),
"Random Forest": RandomForestRegressor(random_state=42),
"Linear Regression": LinearRegression()
}
performance = {}
predictions = {}
for name, model in models.items():
model.fit(X_train, y_train)
preds = model.predict(X_test)
predictions[name] = preds
performance[name] = {
"MAE": mean_absolute_error(y_test, preds),
"RMSE": np.sqrt(mean_squared_error(y_test, preds)),
"R2 Score": r2_score(y_test, preds)
}
ensemble_preds = np.mean(np.column_stack(list(predictions.values())), axis=1)
performance["Ensemble"] = {
"MAE": mean_absolute_error(y_test, ensemble_preds),
"RMSE": np.sqrt(mean_squared_error(y_test, ensemble_preds)),
"R2 Score": r2_score(y_test, ensemble_preds)
}
print("Model Performance:")
for model_name, scores in performance.items():
print(f"{model_name}: MAE={scores['MAE']:.3f}, RMSE={scores['RMSE']:.3f}, R2 Score={scores['R2 Score']:.3f}")
Model Performance: Gradient Boosting: MAE=0.006, RMSE=0.012, R2 Score=0.398 Random Forest: MAE=0.005, RMSE=0.011, R2 Score=0.411 Linear Regression: MAE=0.020, RMSE=0.164, R2 Score=-120.266 Ensemble: MAE=0.010, RMSE=0.056, R2 Score=-13.010
This code trains three different models (gradient boosting, random forest, and linear regression) to predict precipitation using features from a timestamp, then evaluates each model's performance with metrics like MAE, RMSE, and R2. It also combines the predictions from all models into an ensemble and checks how the ensemble performs compared to the individual models.
Advanced EDA¶
I used Isolation Forest to detect outliers in temperature and precipitation data. After identifying the outliers, I visualized them in plots with the outliers marked in red for temperature and orange for precipitation, as part of the advanced EDA for anomaly detection.
from sklearn.ensemble import IsolationForest
numeric_columns = df.select_dtypes(include=[np.number]).columns
iso_forest = IsolationForest(contamination=0.05, random_state=42)
df['outlier'] = iso_forest.fit_predict(df[numeric_columns])
outliers = df[df['outlier'] == -1]
print(f"Total number of detected outliers: {len(outliers)}")
fig, (ax_temp, ax_precip) = plt.subplots(2, figsize=(12, 10))
ax_temp.plot(df.index, df['temperature_fahrenheit'], label='Temperature', color='blue')
ax_temp.scatter(outliers.index, outliers['temperature_fahrenheit'], color='red', label='Outliers')
ax_temp.set_title("Outliers in Temperature Data")
ax_temp.set_xlabel("Time")
ax_temp.set_ylabel("Temperature (°F)")
ax_temp.legend()
ax_precip.plot(df.index, df['precip_mm'], label='Precipitation', color='green')
ax_precip.scatter(outliers.index, outliers['precip_mm'], color='orange', label='Outliers')
ax_precip.set_title("Outliers in Precipitation Data")
ax_precip.set_xlabel("Time")
ax_precip.set_ylabel("Precipitation (mm)")
ax_precip.legend()
plt.tight_layout()
plt.show()
Total number of detected outliers: 2359
I applied the Isolation Forest method to find outliers in temperature and precipitation data. It then creates plots showing the data over time, marking the outliers in red for temperature and orange for precipitation.
Advanced Climate Analysis:¶
I selected countries from different continents and plotted their temperature trends over time. Each plot shows the temperature changes for each country, helping analyze long-term climate patterns in various regions.
countries_to_plot = {
"North America": ["USA United States of America United States of America", "Canada", "Mexico"],
"South America": ["Brazil", "Argentina", "Haiti"],
"Europe": ["Germany", "France", "Italy"],
"Africa": ["Nigeria", "South Africa", "Egypt"],
"Asia": ["Sri Lanka", "North Korea", "Vietnam"],
"Oceania": ["Australia", "New Zealand"]
}
for continent, countries in countries_to_plot.items():
subset_data = df[df['country'].isin(countries)]
plt.figure(figsize=(12, 6))
sns.lineplot(x=df["last_updated"], y='temperature_celsius', hue='country', data=subset_data)
plt.title(f'Long-Term Climate Patterns in {continent}')
plt.xlabel('Date')
plt.ylabel('Temperature (°C)')
plt.legend(title='Country', loc='upper right')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
I selected countries from each continent and plotted their temperature trends over time. Each plot shows how temperature has changed, with different colors for each country.
Advanced Environmental Impact Analysis:¶
I analyzed the relationship between air quality and weather parameters, such as temperature and humidity. The scatter plots show weak or no strong correlation between temperature and carbon monoxide levels, and humidity and PM2.5 levels.
plt.figure(figsize=(10, 6))
sns.scatterplot(x='temperature_celsius', y='air_quality_Carbon_Monoxide', data=cleaned_data, color = "teal")
plt.title('Temperature vs Carbon Monoxide Levels')
plt.show()
The scatter plot shows the relationship between temperature and carbon monoxide levels, and the data suggests that there is no strong correlation between the two variables.
plt.figure(figsize=(10, 6))
sns.scatterplot(x='humidity', y='air_quality_PM2.5', data=cleaned_data, alpha=0.7, color='purple')
plt.title('Humidity vs PM2.5 Levels')
plt.xlabel('Humidity (%)')
plt.ylabel('PM2.5 Levels')
plt.tight_layout()
plt.show()
The scatter plot shows the relationship between temperature and carbon monoxide levels, and the data suggests that there is no strong correlation between the two variables.
Advanced Feature Importance¶
I used a RandomForestRegressor to assess feature importance in predicting temperature. The bar plot shows how each feature (precipitation, humidity, wind speed, and pressure) contributes to the model's prediction of temperature
features = cleaned_data[['precip_mm', 'humidity', 'wind_kph', 'pressure_mb']]
target = cleaned_data['temperature_celsius']
model = RandomForestRegressor()
model.fit(features, target)
importance = pd.Series(model.feature_importances_, index=features.columns)
importance.plot(kind='bar', title='Key Features Influencing Temperature')
plt.show()
Humidity has the strongest influence on temperature. Wind speed and pressure have moderate influence. Precipitation has the weakest influence
Advanced Spatial Analysis:¶
I visualized global wind speed patterns by plotting wind speed data on a world map. I used latitude and longitude coordinates to place windspeed data points, with colors showing wind speed variations across different regions.
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
plt.figure(figsize=(15, 10))
map_plot = Basemap(projection='cyl', resolution='l',
llcrnrlat=-90, urcrnrlat=90,
llcrnrlon=-180, urcrnrlon=180)
map_plot.drawcoastlines()
map_plot.drawcountries()
lats = cleaned_data['latitude']
lons = cleaned_data['longitude']
temps = cleaned_data['wind_kph']
scatter = map_plot.scatter(lons, lats, c=temps, cmap='coolwarm', marker='o', s=40, alpha=0.7, zorder=5)
plt.colorbar(scatter, label='Wind Speed (kph)')
plt.title('Global Wind Patterns')
plt.show()
The map shows that wind speeds tend to be higher in the Northern Hemisphere, with some areas reaching over 30 kph. Whereas the Southern Hemisphere generally has lower wind speeds, mostly between 10 and 20 kph, showing differences in how air circulates between the two hemispheres.
Advanced Geographical Patterns:¶
I generated an intearctive global temperature heatmap using plotly. It shows temperature variations across the world, with yellow representing the hottest areas and purple for the coldest regions, highlighting high temperatures in parts of Africa, the Middle East, and Australia.
import plotly.express as px
fig = px.scatter_geo(
cleaned_data,
lat='latitude',
lon='longitude',
color='temperature_celsius',
hover_name='location_name',
projection='natural earth',
title='Global Temperature Heatmap'
)
fig.update_geos(showcoastlines=True, coastlinecolor="black", showland=True, landcolor="white")
fig.show()
The heatmap displays global temperatures, with yellow indicating the hottest areas primarily in the tropics, while purple represents the coldest regions. It is interesting to note the high temperatures in parts of Africa, the Middle East, and Australia.